Accessibility:

alt text…. {r , fig.alt = “ur alt text”}

scale_colour_viridis_d() or scale_fill_viridis_d()

write up: readme description of the project - pointing out most important findings (kinda like https://dcs-210.github.io/project-matt-annie-ethan/)

1. Introduction

We aim to examine what makes a Youtube video popular using a dataset that contains Youtube data.

The data is from Kaggle (the Kaggle link can be found here… (https://www.kaggle.com/datasnaek/youtube-new?select=USvideos.csv) . It was posted to Kaggle by a user called Mitchell J. Mitchell J was able to build the dataset by writing a Python script that scraped the web for Youtube data (scraper code can be found here…. https://github.com/mitchelljy/Trending-YouTube-Scraper) . The Kaggle post includes data on Youtube statistics for USA, Great Britain, Germany, Canada, France, Russia, Mexico, South Korea, Japan, and India (each country’s is posted in a separate dataset). We decided to only look at US data for two reasons.

Reason #1: the datasets all together (if combined) would have been enormous. With so many megabytes of data being processed by our code, our program may have been a bit slower with all countries’ data included. Reason #2: the US is the country we know the best. We may have an easier time drawing conclusions and designing analysis of US data than we would with world data because we are American researchers that can relate to American search habits.

There are ncol(yt) columns and nrow(yt) rows in the dataset.

Below is a list of the variables and what they mean…. video_id: a unique alphanumeric id for each video (there are only 6351 unique videos in the data) trending_date: a date on which that video was trending video_title: title of the video (a string) channel_title: name of the Youtube channel (a string) category_id: number corresponding to category of the video (ranging 1-43) publish_time: a datatime object for when the video was uploaded to Youtube tags: categorical tag names attached to the video views: number of views likes: number of likes dislikes: number of dislikes comment_count: number of comments thumbnail_link: a link to the video comments_disabled: boolean for whether or not comments are disabled ratings_disabled: boolean for whether or not likes and dislikes are disabled video_error_or_removed: boolean for whether or not the video has been taken off the site description: description of the video

2. Data

## Rows: 40,949
## Columns: 16
## $ video_id               <chr> "2kyS6SvSYSE", "1ZAPwfrtAFY", "5qpjK5DgCt4", "p…
## $ trending_date          <chr> "17.14.11", "17.14.11", "17.14.11", "17.14.11",…
## $ title                  <chr> "WE WANT TO TALK ABOUT OUR MARRIAGE", "The Trum…
## $ channel_title          <chr> "CaseyNeistat", "LastWeekTonight", "Rudy Mancus…
## $ category_id            <dbl> 22, 24, 23, 24, 24, 28, 24, 28, 1, 25, 17, 24, …
## $ publish_time           <dttm> 2017-11-13 17:13:01, 2017-11-13 07:30:00, 2017…
## $ tags                   <chr> "SHANtell martin", "last week tonight trump pre…
## $ views                  <dbl> 748374, 2418783, 3191434, 343168, 2095731, 1191…
## $ likes                  <dbl> 57527, 97185, 146033, 10172, 132235, 9763, 1599…
## $ dislikes               <dbl> 2966, 6146, 5339, 666, 1989, 511, 2445, 778, 11…
## $ comment_count          <dbl> 15954, 12703, 8181, 2146, 17518, 1434, 1970, 34…
## $ thumbnail_link         <chr> "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg…
## $ comments_disabled      <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ ratings_disabled       <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ video_error_or_removed <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE…
## $ description            <chr> "SHANTELL'S CHANNEL - https://www.youtube.com/s…
##  [1] 22 24 23 28  1 25 17 10 15 27 26  2 19 20 29 43
## # A tibble: 6 × 16
##   video_id    trending_date title  channel_title category_id publish_time       
##   <chr>       <chr>         <chr>  <chr>               <dbl> <dttm>             
## 1 2kyS6SvSYSE 17.14.11      WE WA… CaseyNeistat           22 2017-11-13 17:13:01
## 2 1ZAPwfrtAFY 17.14.11      The T… LastWeekToni…          24 2017-11-13 07:30:00
## 3 5qpjK5DgCt4 17.14.11      Racis… Rudy Mancuso           23 2017-11-12 19:05:24
## 4 puqaWrEC7tY 17.14.11      Nicke… Good Mythica…          24 2017-11-13 11:00:04
## 5 d380meD0W0M 17.14.11      I Dar… nigahiga               24 2017-11-12 18:01:41
## 6 gHZ1Qz0KiKM 17.14.11      2 Wee… iJustine               28 2017-11-13 19:07:23
## # … with 10 more variables: tags <chr>, views <dbl>, likes <dbl>,
## #   dislikes <dbl>, comment_count <dbl>, thumbnail_link <chr>,
## #   comments_disabled <lgl>, ratings_disabled <lgl>,
## #   video_error_or_removed <lgl>, description <chr>

We noticed that our dataset contains a variable called category_id but the variable doesn’t tell us what these categories are. Searching online, we found that Youtube has 16 unique categories, and when using the unique() function with our category_id variable, they match up. The list of categories can be found here: https://techpostplus.com/youtube-video-categories-list-faqs-and-solutions/. It would be interesting to see how popularity changes with category. In order to create some pretty cool visualizations, we will add a new variable to our dataset called category, giving us the name of each corresponding category.

## # A tibble: 6 × 17
##   video_id    trending_date title  channel_title category_id publish_time       
##   <chr>       <chr>         <chr>  <chr>               <dbl> <dttm>             
## 1 2kyS6SvSYSE 17.14.11      WE WA… CaseyNeistat           22 2017-11-13 17:13:01
## 2 1ZAPwfrtAFY 17.14.11      The T… LastWeekToni…          24 2017-11-13 07:30:00
## 3 5qpjK5DgCt4 17.14.11      Racis… Rudy Mancuso           23 2017-11-12 19:05:24
## 4 puqaWrEC7tY 17.14.11      Nicke… Good Mythica…          24 2017-11-13 11:00:04
## 5 d380meD0W0M 17.14.11      I Dar… nigahiga               24 2017-11-12 18:01:41
## 6 gHZ1Qz0KiKM 17.14.11      2 Wee… iJustine               28 2017-11-13 19:07:23
## # … with 11 more variables: tags <chr>, views <dbl>, likes <dbl>,
## #   dislikes <dbl>, comment_count <dbl>, thumbnail_link <chr>,
## #   comments_disabled <lgl>, ratings_disabled <lgl>,
## #   video_error_or_removed <lgl>, description <chr>, category <chr>

3. Data analysis plan

## [1] "2017-11-14" "2018-06-14"
## [1] "2017-11-14" "2018-06-14"

There are a number of variables that we may choose as our dependent variable. Number of likes is fascinating to look at. Number of views is fascinating to look at, too.

We can play around with some or all of our predictor variables in order to investigate what factors lead to a popular Youtube video. Videos that have many tags? Videos in a certain category? Videos on a channel that is already popular? Videos posted at a certain time of day? There are many fun questions that we may ask as we progress through the analysis. All of the 14 variables, other than views and likes (which will be used as dependent variables), will probably be useful at some point in our analysis except for thumbnail_link (we won’t be doing any web scraping ourselves). Comment_count, tags, category, channel_title, publish_time, and others will certainly be interesting to examine once we dive into the data. Number of dislikes is an especially juicy variable - what if increased dislikes (indicating increased controversy) are tied to an increase in number of views?

We may want to create our own variables at certain points in our analysis using the mutate() function. Further, we may want to create a number of plots to visualize our data using ggplot’s geom_point(), geom_bar(), and other functions. We may want to select certain rows to look at using select(). We may want to count the occurrence of certain data trends using count(). We may want to sort the data using arrange(). There are many other functions that we have learned (and new ones that we will learn) that will be helpful for this project.

Since we do not know which predictor variables (or even which dependent variable) we will use yet, it is impossible to surmise what will suffice as conclusive evidence for our research question. Strong correlation plots and other visualizations that display clear trends will most likely be our “proof” once we come up with some conclusions to include in our final file.

Some visualizations of the data are shown below….

Views vs. Likes

Views vs. Dislikes

Views vs. Comment Count

Views vs. Date Published

Views v Number Of Characters In Video Title

Category Analysis

Other Plots

The takeaway from the plot above is that the videos that do garner many views tend to begin trending shortly after they are published.